Introduction

Vehicle Attributes and Effects on MPG

Column

Motivation

The purpose of this study to was to create a linear regression model based on vehicle data from 1985 Ward’s Automotive Yearbook in hopes of uncovering possible relationships between MPG (miles per gallon) and other vehicle attributes.

Through separating city and highway MPG, I hoped to uncover possible differences between how to the two rates are impacted by the different vehicle attributes given in the data.

The dataset in this study contains 205 observations, 6 of which were removed due to missing values, leading to grand total of 199 observations. Additionally, the dataset originally contained 26 variables, but I chose to only study 15 of these, subetting the data based upon high amounts of missing values and relevancy to the research question.

highway city fueltype aspiration wheelbase length width height curbweight enginesize bore stroke compressionratio horsepower peakrpm
27 21 gas std 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111 5000
27 21 gas std 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111 5000
26 19 gas std 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 9.0 154 5000
30 24 gas std 99.8 176.6 66.2 54.3 2337 109 3.19 3.40 10.0 102 5500
22 18 gas std 99.4 176.6 66.4 54.3 2824 136 3.19 3.40 8.0 115 5500
25 19 gas std 99.8 177.3 66.3 53.1 2507 136 3.19 3.40 8.5 110 5500
25 19 gas std 105.8 192.7 71.4 55.7 2844 136 3.19 3.40 8.5 110 5500
25 19 gas std 105.8 192.7 71.4 55.7 2954 136 3.19 3.40 8.5 110 5500
20 17 gas turbo 105.8 192.7 71.4 55.9 3086 131 3.13 3.40 8.3 140 5500
22 16 gas turbo 99.5 178.2 67.9 52.0 3053 131 3.13 3.40 7.0 160 5500

Column

Variable Index

The following explanatory variables were the focus of our analysis:

  • Fuel Type: gas or diesel

  • Aspiration: standard (std) or turbo

  • Wheel Base: the horizontal distance (in.) between the centers of the front and rear wheels

  • Length: length (in.) of vehicle

  • Width: width (in.) of vehicle

  • Height: height (in.) of vehicle

  • Curb-weight: the published weight (lbs.) of a vehicle with a full tank of fuel and all fluids filled

  • Engine-size: the volume (cubic in.) of fuel and air that can be pushed through a car’s cylinders

  • Bore: diameter (in.) of engine’s cylinder

  • Stroke: depth (in.) of engine’s cylinder

  • Compression-ratio: ratio measuring how much cylinder volume is able to be compressed

  • Horsepower: the power an engine produces (550 ft-lbs per second)

  • Peak-RPM: the max speed an engine can spin (rotations per second)

EDA

Column

Highway

ggplot(vehicles, aes(x = highway)) + geom_histogram(color = "white")

City

ggplot(vehicles, aes(x = city)) + geom_histogram(color = "white")

Column

Explanation

Correlation Exploration

Column

Highway

City

Column

Explanation of Collinearity

LASSO

Column

Lambda Estimate for Highway Model

Reduced Highway MPG Model

14 x 1 sparse Matrix of class "dgCMatrix"
                           s0
(Intercept)       3.395918097
fueltype          .          
aspiration        .          
wheelbase         .          
length            .          
width             0.004117396
height            .          
curbweight       -0.190851601
enginesize       -0.019474681
bore              0.005619811
stroke            0.003983555
compressionratio  0.069151866
horsepower        .          
peakrpm          -0.012633210

Column

Estimated Lambda for City Model

Reduced City MPG Model

14 x 1 sparse Matrix of class "dgCMatrix"
                           s0
(Intercept)       3.195646783
fueltype          .          
aspiration        .          
wheelbase         .          
length           -0.007060336
width             .          
height            .          
curbweight       -0.145091194
enginesize        .          
bore              .          
stroke            .          
compressionratio  0.080345176
horsepower       -0.074876843
peakrpm          -0.002452953

Column

Explanation

Subset EDA

Column

City MPG

Highway MPG

Column

Peak RPM

Horsepower

Compression Ratio

Curb Weight

Length

Stroke

Width

LRM

Column

Explanation

Column

Linearity

Normality

---
title: "Final Project"
author: "Jesse Devitt"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: cosmo
      primary: "blue"
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

<style>
.chart-title {  /* chart_title  */
   font-size: 20px;
  }
body{  /* Normal  */
      font-size: 18px;
  }
</style>

```{r setup, include=FALSE}
library(flexdashboard)
library(shiny)
library(shinydashboard)
```

Introduction
===
<head>
    <base target = "_blank">
</head>

<font size=5>
**Vehicle Attributes and Effects on MPG**
</font>


Column {data-width=650}
-----------------------------------------------------------------------

### Motivation

The purpose of this study to was to create a linear regression model based on vehicle data from 1985 Ward's Automotive Yearbook in hopes of uncovering possible relationships between MPG (miles per gallon) and other vehicle attributes.

Through separating city and highway MPG, I hoped to uncover possible differences between how to the two rates are impacted by the different vehicle attributes given in the data.

The dataset in this study contains 205 observations, 6 of which were removed due to missing values, leading to grand total of 199 observations.
Additionally, the dataset originally contained 26 variables, but I chose to only study 15 of these, subetting the data based upon high amounts of missing values and relevancy to the research question.

```{r}
knitr::opts_chunk$set(echo = TRUE)
library(pacman)
library(tidyverse)
library(plotly)
library(corrplot)
library(RColorBrewer)
library(stats)
vehicles <- read.csv("~/Library/Mobile Documents/com~apple~CloudDocs/MTH 369/automobile/imports-85.data", header=FALSE)
vehicles <- vehicles %>% select(-c(V1, V2, V3, V6, V7, V8, V9, V15, V16, V18, V26))
names(vehicles) <- c("fueltype", "aspiration", "wheelbase", "length", "width", "height", "curbweight", "enginesize", "bore", "stroke", "compressionratio", "horsepower", "peakrpm", "city", "highway")
vehicles$city <- as.numeric(vehicles$city)
vehicles$highway <- as.numeric(vehicles$highway)
vehicles$curbweight <- as.numeric(vehicles$curbweight)
vehicles$enginesize <- as.numeric(vehicles$enginesize)
vehicles$bore <- as.numeric(vehicles$bore)
vehicles$stroke <- as.numeric(vehicles$stroke)
vehicles$compressionratio <- as.numeric(vehicles$compressionratio)
vehicles$horsepower <- as.numeric(vehicles$horsepower)
vehicles$peakrpm <- as.numeric(vehicles$peakrpm)
vehicles$peakrpm[vehicles$peakrpm == "?"] <- NA
vehicles$horsepower[vehicles$horsepower == "?"] <- NA
vehicles$stroke[vehicles$stroke == "?"] <- NA
vehicles$bore[vehicles$bore == "?"] <- NA
vehicles$fueltype <- as.factor(vehicles$fueltype)
vehicles$aspiration <- as.factor(vehicles$aspiration)
vehicles <- vehicles[complete.cases(vehicles),]
vehicles <- vehicles[, c("city", names(vehicles)[-which(names(vehicles) == "city")])]
vehicles <- vehicles[, c("highway", names(vehicles)[-which(names(vehicles) == "highway")])]
standardized <- apply(vehicles[, 5:15], 2, function(x) (x-mean(x)) / sd(x))
v <- vehicles %>% dplyr::select(highway, city, fueltype, aspiration)
stan_vehicles <- cbind.data.frame(v, standardized)
knitr::kable(vehicles[1:10,])
```

Column {data-width=350}
-----------------------------------------------------------------------

### Variable Index

The following explanatory variables were the focus of our analysis:

- Fuel Type: gas or diesel

- Aspiration: standard (std) or turbo

- Wheel Base: the horizontal distance (in.) between the centers of the front and rear wheels

- Length: length (in.) of vehicle

- Width: width (in.) of vehicle

- Height: height (in.) of vehicle

- Curb-weight: the published weight (lbs.) of a vehicle with a full tank of fuel and all fluids filled 

- Engine-size: the volume (cubic in.) of fuel and air that can be pushed through a car's cylinders

- Bore: diameter (in.) of engine's cylinder

- Stroke: depth (in.) of engine's cylinder

- Compression-ratio: ratio measuring how much cylinder volume is able to be compressed

- Horsepower: the power an engine produces (550 ft-lbs per second)

- Peak-RPM: the max speed an engine can spin (rotations per second)

EDA
===

Column {.tabset data-width=650}
---

### Highway

```{r}
ggplot(vehicles, aes(x = highway)) + geom_histogram(color = "white")

```

### City

```{r}
ggplot(vehicles, aes(x = city)) + geom_histogram(color = "white")
```

Column {data-width=350}
---

### Explanation

Correlation Exploration
===

Column {.tabset data-width=650}
---

### Highway

```{r, echo=FALSE}
highwaynumeric <- vehicles %>% select(-c(fueltype, aspiration, city))
m1 <- round(cor(highwaynumeric), 2)
corrplot(m1, method = c("number"),type="upper",main="Highway MPG",mar=c(0,0,1,0), number.cex = 0.5)
```

### City

```{r, echo=FALSE}
citynumeric <- vehicles %>% select(-c(fueltype, aspiration, highway))
m <- round(cor(citynumeric), 2)
corrplot(m, method = c("number"),type="upper",main="City MPG",mar=c(0,0,1,0), number.cex = 0.5)
```

Column {data-width=350}
---

### Explanation of Collinearity

LASSO
===

Column {data-width=300}
---

### Lambda Estimate for Highway Model

```{r, fig.align='center', echo=FALSE}
x<-as.matrix(stan_vehicles[,3:15])
y1<-log(vehicles$highway)

set.seed(2000)
proportion_split<-0.7
train<-sample(1:nrow(x), round(nrow(x)*proportion_split))

y1.train<-y1[train]
y1.test<-y1[-train]

x.train<-x[train,]
x.test<-x[-train,]

library(glmnet)
set.seed(2000)
cv.lasso1<-cv.glmnet(x.train, y1.train, alpha = 1)
#cv.lasso1$lambda.min
plot(cv.lasso1)
```

### Reduced Highway MPG Model

```{r, fig.align='center', echo=FALSE}
model1<-glmnet(x.train, y1.train, alpha = 1, lambda = cv.lasso1$lambda.min)
coef1<-coef(model1)

#to compute training SSE from LASSO regression
y_predictedtrain1 <- predict(model1, s = cv.lasso1$lambda.min, newx = x.train)
SSEtrain1<-sum((y_predictedtrain1-y1.train)^2)
residuals1 <- y_predictedtrain1 - y1.train

#Computing R-squared
SSTOtrain1<-sum((y1.train-mean(y1.train))^2)
R2train1<-1-SSEtrain1/SSTOtrain1

#to compute testing SSE from LASSO regression
y_predictedtest1 <- predict(model1, s = cv.lasso1$lambda.min, newx = x.test)
SSEtest1<-sum((y_predictedtest1-y1.test)^2)

#Computing R-squared
SSTOtest1<-sum((y1.test-mean(y1.test))^2)
R2test1<-1-SSEtest1/SSTOtest1

print(coef1)
```

Column {data-width=300}
---

### Estimated Lambda for City Model

```{r, fig.align='center', echo=FALSE}
library(MASS)

#bc<-boxcox(city~peakrpm+horsepower+compressionratio+curbweight+length, data = vehicles)
#lambda<-bc$x[which.max(bc$y)]

x<-as.matrix(stan_vehicles[,3:15])
y<-log(vehicles$city)

set.seed(2000)
proportion_split<-0.7
train<-sample(1:nrow(x), round(nrow(x)*proportion_split))

y.train<-y[train]
y.test<-y[-train]

x.train<-x[train,]
x.test<-x[-train,]

library(glmnet)
set.seed(2000)
cv.lasso<-cv.glmnet(x.train, y.train, alpha = 1)
#cv.lasso$lambda.min
plot(cv.lasso)
```

### Reduced City MPG Model

```{r, fig.align='center', echo=FALSE}
model<-glmnet(x.train, y.train, alpha = 1, lambda = cv.lasso$lambda.min)
coef<-coef(model)

#to compute training SSE from LASSO regression
y_predictedtrain <- predict(model, s = cv.lasso$lambda.min, newx = x.train)
SSEtrain<-sum((y_predictedtrain-y.train)^2)
residuals<-y_predictedtrain - y.train # fitted values are y_predicted

#Computing R-squared
SSTOtrain<-sum((y.train-mean(y.train))^2)
R2train<-1-SSEtrain/SSTOtrain

#to compute testing SSE from LASSO regression
y_predictedtest <- predict(model, s = cv.lasso$lambda.min, newx = x.test)
SSEtest<-sum((y_predictedtest-y.test)^2)

#Computing R-squared
SSTOtest<-sum((y.test-mean(y.test))^2)
R2test<-1-SSEtest/SSTOtest

print(coef)
```

Column {data-width=400}
---

### Explanation

Subset EDA
===

Column {.tabset data-width=500}
---

### City MPG

```{r, fig.align='center', echo=FALSE}
#cfit <- lm(city~peakrpm+horsepower+compressionratio+curbweight+length, data = vehicles)
#summary(cfit)
```

### Highway MPG

```{r, fig.align='center', echo=FALSE}
#hfit <- lm(highway~peakrpm+horsepower+compressionratio+curbweight+length+stroke+width, data = vehicles)
#summary(hfit)
```

Column {.tabset data-width=500}
---

### Peak RPM
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = peakrpm)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "red"))
```

### Horsepower
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = horsepower)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "red"))
```

### Compression Ratio
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = compressionratio)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "red"))
```

### Curb Weight
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = curbweight)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "red"))
```

### Length
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = length)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "red"))
```

### Stroke
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = stroke)) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("Highway MPG" = "red"))
```

### Width
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = width)) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("Highway MPG" = "red"))
```


LRM
===

Column {data-width=350}
---

### Explanation

```{r, fig.align='center', echo=FALSE}

```

Column {.tabset data-width=650}
---

### Linearity

```{r, fig.align='center', echo=FALSE, out.width="50%"}
plot(residuals1~y_predictedtrain1, xlab = "Fitted Values", ylab = "Residuals", main = "Highway MPG", col = "red")
abline(h=0)

plot(residuals~y_predictedtrain, xlab = "Fitted Values", ylab = "Residuals", main = "City MPG", col = "blue")
abline(h=0)
```

### Normality

```{r, fig.align='center', echo=FALSE, out.width="50%"}
library(nortest)
qqnorm(residuals1)
qqline(residuals1)
#ad.test(residuals1)

qqnorm(residuals)
qqline(residuals)
#ad.test(residuals)
```